42 research outputs found
Conditional Image-Text Embedding Networks
This paper presents an approach for grounding phrases in images which jointly
learns multiple text-conditioned embeddings in a single end-to-end model. In
order to differentiate text phrases into semantically distinct subspaces, we
propose a concept weight branch that automatically assigns phrases to
embeddings, whereas prior works predefine such assignments. Our proposed
solution simplifies the representation requirements for individual embeddings
and allows the underrepresented concepts to take advantage of the shared
representations before feeding them into concept-specific layers. Comprehensive
experiments verify the effectiveness of our approach across three phrase
grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where
we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a
strong region-phrase embedding baseline.Comment: ECCV 2018 accepted pape
Deep Shape Matching
We cast shape matching as metric learning with convolutional networks. We
break the end-to-end process of image representation into two parts. Firstly,
well established efficient methods are chosen to turn the images into edge
maps. Secondly, the network is trained with edge maps of landmark images, which
are automatically obtained by a structure-from-motion pipeline. The learned
representation is evaluated on a range of different tasks, providing
improvements on challenging cases of domain generalization, generic
sketch-based image retrieval or its fine-grained counterpart. In contrast to
other methods that learn a different model per task, object category, or
domain, we use the same network throughout all our experiments, achieving
state-of-the-art results in multiple benchmarks.Comment: ECCV 201
Unsupervised Monocular Depth Estimation for Night-time Images using Adversarial Domain Feature Adaptation
In this paper, we look into the problem of estimating per-pixel depth maps
from unconstrained RGB monocular night-time images which is a difficult task
that has not been addressed adequately in the literature. The state-of-the-art
day-time depth estimation methods fail miserably when tested with night-time
images due to a large domain shift between them. The usual photo metric losses
used for training these networks may not work for night-time images due to the
absence of uniform lighting which is commonly present in day-time images,
making it a difficult problem to solve. We propose to solve this problem by
posing it as a domain adaptation problem where a network trained with day-time
images is adapted to work for night-time images. Specifically, an encoder is
trained to generate features from night-time images that are indistinguishable
from those obtained from day-time images by using a PatchGAN-based adversarial
discriminative learning method. Unlike the existing methods that directly adapt
depth prediction (network output), we propose to adapt feature maps obtained
from the encoder network so that a pre-trained day-time depth decoder can be
directly used for predicting depth from these adapted features. Hence, the
resulting method is termed as "Adversarial Domain Feature Adaptation (ADFA)"
and its efficacy is demonstrated through experimentation on the challenging
Oxford night driving dataset. Also, The modular encoder-decoder architecture
for the proposed ADFA method allows us to use the encoder module as a feature
extractor which can be used in many other applications. One such application is
demonstrated where the features obtained from our adapted encoder network are
shown to outperform other state-of-the-art methods in a visual place
recognition problem, thereby, further establishing the usefulness and
effectiveness of the proposed approach.Comment: ECCV 202
GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints
Learned local descriptors based on Convolutional Neural Networks (CNNs) have
achieved significant improvements on patch-based benchmarks, whereas not having
demonstrated strong generalization ability on recent benchmarks of image-based
3D reconstruction. In this paper, we mitigate this limitation by proposing a
novel local descriptor learning approach that integrates geometry constraints
from multi-view reconstructions, which benefits the learning process in terms
of data generation, data sampling and loss computation. We refer to the
proposed descriptor as GeoDesc, and demonstrate its superior performance on
various large-scale benchmarks, and in particular show its great success on
challenging reconstruction tasks. Moreover, we provide guidelines towards
practical integration of learned descriptors in Structure-from-Motion (SfM)
pipelines, showing the good trade-off that GeoDesc delivers to 3D
reconstruction tasks between accuracy and efficiency.Comment: Accepted to ECCV'1
Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval
Optimising a ranking-based metric, such as Average Precision (AP), is
notoriously challenging due to the fact that it is non-differentiable, and
hence cannot be optimised directly using gradient-descent methods. To this end,
we introduce an objective that optimises instead a smoothed approximation of
AP, coined Smooth-AP. Smooth-AP is a plug-and-play objective function that
allows for end-to-end training of deep networks with a simple and elegant
implementation. We also present an analysis for why directly optimising the
ranking based metric of AP offers benefits over other deep metric learning
losses. We apply Smooth-AP to standard retrieval benchmarks: Stanford Online
products and VehicleID, and also evaluate on larger-scale datasets: INaturalist
for fine-grained category retrieval, and VGGFace2 and IJB-C for face retrieval.
In all cases, we improve the performance over the state-of-the-art, especially
for larger-scale datasets, thus demonstrating the effectiveness and scalability
of Smooth-AP to real-world scenarios.Comment: Accepted at ECCV 202
Compact Deep Aggregation for Set Retrieval
The objective of this work is to learn a compact embedding of a set of
descriptors that is suitable for efficient retrieval and ranking, whilst
maintaining discriminability of the individual descriptors. We focus on a
specific example of this general problem -- that of retrieving images
containing multiple faces from a large scale dataset of images. Here the set
consists of the face descriptors in each image, and given a query for multiple
identities, the goal is then to retrieve, in order, images which contain all
the identities, all but one, \etc
To this end, we make the following contributions: first, we propose a CNN
architecture -- {\em SetNet} -- to achieve the objective: it learns face
descriptors and their aggregation over a set to produce a compact fixed length
descriptor designed for set retrieval, and the score of an image is a count of
the number of identities that match the query; second, we show that this
compact descriptor has minimal loss of discriminability up to two faces per
image, and degrades slowly after that -- far exceeding a number of baselines;
third, we explore the speed vs.\ retrieval quality trade-off for set retrieval
using this compact descriptor; and, finally, we collect and annotate a large
dataset of images containing various number of celebrities, which we use for
evaluation and is publicly released.Comment: 20 page
VPR-Bench: An Open-Source Visual Place Recognition Evaluation Framework with Quantifiable Viewpoint and Appearance Change
Visual place recognition (VPR) is the process of recognising a previously visited place using visual information, often under varying appearance conditions and viewpoint changes and with computational constraints. VPR is related to the concepts of localisation, loop closure, image retrieval and is a critical component of many autonomous navigation systems ranging from autonomous vehicles to drones and computer vision systems. While the concept of place recognition has been around for many years, VPR research has grown rapidly as a field over the past decade due to improving camera hardware and its potential for deep learning-based techniques, and has become a widely studied topic in both the computer vision and robotics communities. This growth however has led to fragmentation and a lack of standardisation in the field, especially concerning performance evaluation. Moreover, the notion of viewpoint and illumination invariance of VPR techniques has largely been assessed qualitatively and hence ambiguously in the past. In this paper, we address these gaps through a new comprehensive open-source framework for assessing the performance of VPR techniques, dubbed “VPR-Bench”. VPR-Bench (Open-sourced at: https://github.com/MubarizZaffar/VPR-Bench) introduces two much-needed capabilities for VPR researchers: firstly, it contains a benchmark of 12 fully-integrated datasets and 10 VPR techniques, and secondly, it integrates a comprehensive variation-quantified dataset for quantifying viewpoint and illumination invariance. We apply and analyse popular evaluation metrics for VPR from both the computer vision and robotics communities, and discuss how these different metrics complement and/or replace each other, depending upon the underlying applications and system requirements. Our analysis reveals that no universal SOTA VPR technique exists, since: (a) state-of-the-art (SOTA) performance is achieved by 8 out of the 10 techniques on at least one dataset, (b) SOTA technique in one community does not necessarily yield SOTA performance in the other given the differences in datasets and metrics. Furthermore, we identify key open challenges since: (c) all 10 techniques suffer greatly in perceptually-aliased and less-structured environments, (d) all techniques suffer from viewpoint variance where lateral change has less effect than 3D change, and (e) directional illumination change has more adverse effects on matching confidence than uniform illumination change. We also present detailed meta-analyses regarding the roles of varying ground-truths, platforms, application requirements and technique parameters. Finally, VPR-Bench provides a unified implementation to deploy these VPR techniques, metrics and datasets, and is extensible through templates